智能论文笔记

本文概述了NVIDIA Nemo的神经电机翻译系统，用于WMT21新闻和生物医学共享翻译任务的受限数据跟踪。我们的新闻任务提交英语 - 德语（EN-DE）和英语 - 俄语（EN-RU）是基于基于基于基线变换器的序列到序列模型之上。具体而言，我们使用1）检查点平均2）模型缩放3）模型缩放3）与从左右分解模型的逆转传播和知识蒸馏的数据增强4）从前一年的测试集上的FINETUNING 5）型号集合6）浅融合解码变压器语言模型和7）嘈杂的频道重新排名。此外，我们的BioMedical任务提交的英语 - 俄语使用生物学偏见的词汇表，并从事新闻任务数据的划痕，从新闻任务数据集中策划的医学相关文本以及共享任务提供的生物医学数据。我们的新闻系统在WMT'20 en-de试验中实现了39.5的Sacrebleu得分优于去年任务38.8的最佳提交。我们的生物医学任务ru-en和en-ru系统分别在WMT'20生物医学任务测试集中达到43.8和40.3的Bleu分数，优于上一年的最佳提交。

translated by 谷歌翻译

Mixed Precision Training

Paulius Micikevicius , Sharan Narang , Jonah Alben , Gregory Diamos , Erich Elsen , David Garcia , Boris Ginsburg , Michael Houston , Oleksii Kuchaiev , Ganesh Venkatesh

分类：

2017-10-10

Increasing the size of a neural network typically improves accuracy but also increases the memory and compute requirements for training the model. We introduce methodology for training deep neural networks using half-precision floating point numbers, without losing model accuracy or having to modify hyperparameters. This nearly halves memory requirements and, on recent GPUs, speeds up arithmetic. Weights, activations, and gradients are stored in IEEE halfprecision format. Since this format has a narrower range than single-precision we propose three techniques for preventing the loss of critical information. Firstly, we recommend maintaining a single-precision copy of weights that accumulates the gradients after each optimizer step (this copy is rounded to half-precision for the forward-and back-propagation). Secondly, we propose loss-scaling to preserve gradient values with small magnitudes. Thirdly, we use half-precision arithmetic that accumulates into single-precision outputs, which are converted to halfprecision before storing to memory. We demonstrate that the proposed methodology works across a wide variety of tasks and modern large scale (exceeding 100 million parameters) model architectures, trained on large datasets.

translated by 谷歌翻译